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A SYSTEM AND METHOD FOR MOTION VECTOR GENERATION 
AND ANALYSIS OF DIGITAL VIDEO CLIPS 



CROSS-REFERENCE TO RELATED APPLICATIONS 



This application claims priority in U.S. Provisional patent application no. 
60/25 1 ,709 filed December 6, 2000. 



T ECHNIC A L FIELD 



, This invention relates to a system and method for performing fast generation of 

Pi motion vectors from the digital video frames, as well as the motion trajectory extraction of 

jSj each identified video object based on the generated motion vectors. The process of 

Pi generating motion vectors is often required in digital video compression for real-time visual 

\J communications and digital video archiving. The process of generating video-object motion 

f|i trajectory is particularly useful to the applications such as surveillance monitoring or 

| V searching over the distributed digital video databases for retrieving relevant video clips 

Ul based on the query. 

j* In addition, the invented motion-trajectory matching scheme presented in the last 

Pi part of the system is a generic solution for measuring the distance or degree of similarity 

p\ between a pair of digital information sequences under comparison, and can be directly 

p exploited for other kinds of data, such as handwriting curves, musical notes, or audio 
patterns. 



BA CK GRO UND 



In the era of multimedia communications, many technical challenges are incurred in 
the processing of digital video, due to its large amount of data involved and limited channel 
bandwidth in practice. For example, in teleconferencing or videophone application, how to 
transmit the digital video (say, acquired through digital camera) to the receiver in real time 
for visual communications requires compression process. As a result, the original amount of 
video data could be greatly reduced by discarding those redundant information while 
keeping those essential ones as much intact as possible in order to maintain the original 
video quality at the receiver side after reconstruction. Such video processing is so-called 
digital video coding. 

A basic method for compressing the amount of digital color video data for fitting 
into the bandwidth has been adopted by the Motion Picture Experts Group (MPEG), which 
produces MPEG-1, MPEG-2, and MPEG-4 compression standards. MPEG achieves high 



data compression by utilizing Discrete Cosine Transform (DCT) technique for the intra- 
coded pictures (called I-frames) and motion estimation and compensation technique for the 
mter-coded pictures (called P-frames or B-frames). I-frames occur only every so often and 
are the least compressed frames; thus, yielding highest video quality and being used as 
reference anchor frames. The frames exist between the I-frames are P-frames and/or B- 
frames generated based on nearby I-frames and/or existing P-frames. The fast motion 
estimation for generating motion vectors is conducted for the P-frames and B-frames only. 
A typical frame structure could be IBBPBBPBBPBBPBB IBBPB. . ., being repeated so until 
the last video frame. The so-called Group of Picture (GOP) begins with an I-frame and ends 
on the frame that is proceeded by the next I-frame. In the above example, the size of GOP is 
15. 

For generating motion vectors by performing motion estimation, each P-frame is 
partitioned into smaller blocks of pixel data; typically, 16 x 16 in size, called macroblock 
(MB) in MPEG's jargon. Then, each MB will be shifted around its neighborhood on the 
previous I-frame or P-frame in order to find out the most resembled block within the 
imposed search range. Hence, only the motion vector of the most resembled block is 
recorded and used to represent the corresponding MB. The motion estimation for the B- 
frame will be conducted similarly but in both directions, forward prediction and backward 
prediction. 

Note that fast motion estimation methods can be directly exploited into all existing 
international video-coding standards as well as any proprietary compressions system that 
adopts similar motion-compensated video coding methodology, as they all share exactly the 
same approach as above-mentioned in reducing temporal redundancy. Besides MPEG, 
another set of video coding standards, ITU's H.261, H.263, and H.26L, for teleconferencing 
or videophone applications also require such motion vector generation. 

Since the above-mentioned exhaustive search typically requires large portion (about 
three-quarters) of total processing time consumed at a typical video encoder. Hence, fast 
algorithm is indispensable to the realization of real-time visual communications services. 
For that, we invented a scalable fast motion estimation technique for performing fast motion 
estimation. The scalability is useful to meet different requirements, such as implementation, 
delay, distortion, computational load and robustness, while minimizing the incidences of 
over-search (thus, increasing delay) or under-search (thus, might be increasing distortion). 
For example, in multimedia PC environment and with such scalable implementation, the 
user can have few choices in selecting the video quality mode for different visual 
communications applications, and even under different Internet traffic situations and/or type 
of services. For example, in videophone, small delay in conversation is probably the most 
important requirement, for trading off reduced video quality. In another application 
scenario, a different fast motion estimation algorithm can be selected for creating a high- 
quality video email (if so desired) and to be sent later on. In this case, it is an off-line video 
application, from which the delay is not an issue. Another example in the so-called object- 
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oriented based video coding where multiple video objects are identified, activating one of 
the block-matching motion estimation profiles can flexibly generate the motion vectors 
associated with each video object. 

After generating the MVs, certain simple statistical measurements (say, mean and 
variance) of the MVs can be easily computed to yield a "content-complexity indicator" 
(in MPEG-4 video coding standard, a pneumonic, called fcode). Such indicator is useful 
to capture or snapshot each video frame in a summarized way. For example, based on the 
category information of the f code, one can easily locate where are the duration of the 
shots that contain high-motion activity. 

The segmented regions that correspond to their associated video object 
respectively can form an alpha-plane mask, which is basically a binary mask for each 
video object and for each individual frame, contrasting from the background. Based on 
such alpha-plane information, the user can easily engage interactive hypermedia-like 
functionality with the video frames. For example, the user can click on any video object 
of interest at any time, say, a fast-moving racing car, then an information box will be 
popped up and provide some pre-stored information, such as the driver's name and age, 
past driving record and Grand Prizes awarded, and other relevant information. Note that 
each video object has its own associated information box, and its trajectory can be served 
as a reliable linkage of the alpha-plane masks of the same video object. 

The generated motion vectors as above-mentioned could be further processed for 
conducting intelligent content-based indexing and retrieval. For example, how to search 
relevant multimedia materials (say, video clips) over large database and retrieve those 
containing identical or very similar content to that of the query would be very desirable to 
many applications, such as Internet search engine and digital library. Rather than relying on 
conventional approach, that is, keywords only, the so-called content-based search is fairly 
promising and effective in achieving the above-mentioned objective, since the "content", 
like color, texture, shape, video object's motion trajectory, and so on, are often hard, and 
sometimes impossible, to describe in words. Therefore, the content-based search of 
multimedia materials is powerful and effective to facilitate this purpose. Obviously, it is not 
a trivial task, and in fact, needs a suite of intelligent processes. Besides other prominent 
features such as color, textures, shape, and so on, motion trajectory is another important key 
feature to digital video. In this invention, the content is specifically meant for the motion 
trajectory of video object identified from the given digital video clip. The remaining of this 
invention presents such method that is capable of automatically identifying multiple moving 
video objects and then simultaneously tracking them based on their motion trajectories 
generated, respectively. In this scenario, the user can impose a query by drawing a curve on 
computer, say, a parabola curve to signify a diver's diving action in order to search those 
video clips that contain such video content. 

Our invention essentially provides a fundamental core technology that mimics 
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human being's capabilities on detecting moving video objects and tracking the objects' 
individual movement to a certain degree. A typical application that can benefit from this 
invention is as follows. In the environment of security surveillance, intruded moving objects 
can be automatically detected, and the trajectory information can be used to steer the video 
cameras to follow the movement of the video objects while recording the incidences. 
Another application example can be found in digital video indexing and retrieval. 



ST IMM A RY OF THF, INVENTION 

It is an object of the present invention to provide a scalable system that integrates 
several methods for performing fast block-matching motion estimation to generate motion 
vectors for video compression and/or as the required input data to conduct other video 
analysis. 

It is a further object to provide implementation hardware architecture for realizing 
the core (i.e., diamond search) of these fast block-matching motion estimation algorithms. 

It is a further object to provide a method that is simply based on the motion vectors 
information to search video database and identify those video clips containing video objects 
with best matching trajectories associated with that from a video clip under query or a 
trajectory curve drawn by user. 

It is a further object to provide a generic solution for measuring the degree of 
similarity of two chain-codes under comparison, where each chain-code is obtained from 
converting the original discretized information encountered in various applications, such as 
handwriting curves, musical notes, extracted audio tones, and so on. 

In summary, these and other objects of the present invention are achieved by a 
system comprising means for generating motion vectors using any one of available fast 
block-matching motion estimation techniques organized and integrated in a scalable fashion 
to optimally meet the demand of various tradeoffs (such as, speed-up gain, complexity, 
video quality, etc.), means for realizing the core of these motion estimation techniques for 
hardware implementation, means for smoothing noisy raw data, means for clustering the 
motion-vector data and validating the clusters so as to automatically detect the video 
objects, means for estimating motion trajectory of each detected video object, means for 
comparing each derived motion trajectory curve with respect to a database of motion 
trajectories, means for receiving a query trajectory and means for identifying video clips 
having video objects best matching the query motion trajectory. 
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BRTF F DE S CRIPTION O F THE P R A WTNGS 

Fig. 1 is an overview of the system and method according to the present invention. 

Figs. 2(a) and 2(b) are diamond search patterns for /rame-predictive motion 
estimation in non-interlaced (or progressive- scanned) video. 

Figs. 3(a) and 3(b) are diamond search patterns for /zeW-predictive motion 
estimation in interlaced video. 

Figs. 4(a) and 4(b) are hexagon search patterns for ^rame-predictive motion 
estimation in non-interlaced (or progressive- scanned) video. For interlaced video, 
the interlaced hexagon search pattern as shown in 4(c) will be exploited for field- 
predictive motion estimation. 

Fig. 5 shows various types of regions of support (ROS). 

Figs. 6(a) and 6(b) as an illustration of motion adaptive pattern (MAP) and 
adaptive rood pattern (ARP), respectively, for being used in the initial search in 
order to identify the best position for local refined search for each block. 

Fig. 7 is a system architecture for a 2-D systolic array. 

Fig. 8 is an illustration of a systolic array organization. 

Fig. 9 is a typical processing element structure. 

Fig. 10 is an illustration of a barrel shifter architecture. 

Figs. 1 1(a) and 1 1(b) show the time scheduling of current-block data and 
reference-block data, respectively. 

Figs. 12(a) and 12(b) show the actual subscript positions with respect to the 
positions in the current and reference images, respectively. 

Fig. 13(a) is an architecture of the switching-based median filter; 13(b) is the 
hierarchical decision-making process for identifying each MV's characteristic and 
the corresponding filtering action taken. 

Fig. 14 is the architecture of the maximum entropy fuzzy clustering (MEFC) to 
achieve unsupervised identification of clusters without any a priori assumption. 

Fig. 15 shows three main stages of the bi-directional motion tracking scheme, 



comprising bi-directional projection, recursive VO tracking and validating and 
Kalman filter smoothing. 



DFTATT F,D DRSCRTPTTON OF THF. TNVHNTTON 

The invention is best organized and described as comprising four parts, Parts A, B, 
C, and D, and the entire system consists of these four parts is illustrated in Fig. 1. Part A is 
the scalable fast block-matching motion estimation method for generating motion vectors 
efficiently, with considerations of multiple factors' tradeoff, such as computational gain, 
complexity, video quality, system and application requirements. Part B presents a systolic- 
array implementation architecture for realizing the computationally-intensive core 
computation of the diamond search system described in Part A, from the hardware point of 
view. Part C is the method for the generation of motion trajectory of each detected video 
object, which consists of a series of data operations: smoothing of motion vector field, 
formation of data clusters through clustering over the smoothed field, formation of video 
objects through validation process, and motion trajectory generation of each detected video 
object. Part D is the method for matching and recognition of chain-coded information, 
including hand-drawn curves, characters, symbols, or even musical notes and extracted 
audio tones. 

Part A. Scalable Fast Block-matching Motion Estimation 

The invention in this part presents a scalable fast block-matching motion estimation system 
for the generation of motion vectors (MVs), which is indispensable in certain applications, 
such as video compression system for visual communications. The invention of scalability 
introduced in this fast motion estimation system can be realized on two aspects: search 
pattern scalability and search distance computing scalability. For the former, multiple 
block-matching motion estimation (BMME) algorithms are introduced, while for the latter, 
a simple downsampling process on pixel field would be effective. Individual or combined 
usage of the above-mentioned two scalability factors would dynamically control the 
generation of motion vectors flexibly, efficiently and optimally, while meeting important 
requirements and constraints, such as computational gain, complexity, quality-of-service 
(QoS), networking dynamics and behaviors, as well as inherent processing modes from the 
other parts of the parent system. 

As mentioned earlier, these BMME methods presented here commonly share a common 
component, called small diamond search pattern (SDSP), in their local refined search. 
Furthermore, as digital video has two kinds: non-interlaced (or progressive-scanned) and 
interlaced, therefore, new search patterns are needed for each of these two categories. The 
design of search patterns and their associated search strategy (or procedures) are 
instrumental to produce faster search and more accurate motion vectors. Based on the 
earlier-developed Diamond Search (DS) search patterns, which are used in frame-based 
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motion estimation for non-interlaced video, as shown in Figure 2, the counterparts of large 
diamond search pattern (LDSP) and small diamond search pattern (SDSP) for^eW-based 
motion estimation in interlaced video are shown in Figure 3, respectively. With such design, 
the entire procedures of DS in the non-interlaced case can be totally applied to interlaced 
video by using these search patterns shown in Figure 3. hi addition, the input video data do 
not need any extra data re-ordering processing, such as separating the entire video frame 
into two fields: even field and odd field. 

Another search pattern, called hexagon search pattern (as shown in Figure 4), has less 
search points involved with possibly slight degradation on the video quality, compared with 
the above-mentioned diamond search patterns, hi frame-predictive motion estimation for 
non-interlaced video, if more motion content is along the horizontal direction, then the 
horizontal hexagon search pattern (HHSP) can be used; otherwise, applying vertical 
hexagon search pattern (VHSP). In field-predictive motion estimation for interlaced video, 
only one type of hexagon pattern, called interlaced hexagon search pattern (IHSP) will be 
used throughout for both even field and odd field, as this pattern has inherent interlaced 
structure (with one alternative line skipped for search) and fairly symmetrical. 

In many typical videos that contain fairly large motion content (e.g., sports) and/or peculiar 
motion content (e.g., cartoon, animation, video games), region-of-support (ROS) based 
prediction information and/or temporal information would be very helpful in producing 
more accurate motion vector results. Thus, more sophisticated fast block-matching motion 
estimation methods are imperative and invented here. For that, new search patterns, called 
motion adaptive pattern (MAP) and adaptive rood pattern (ARP) are introduced. MAP is 
composed of several intelligently chosen search positions, which could be formed based on 
the positions from the origin (0, 0) of the current macroblock (or block, in a more general 
term and shall be interchangeably used, thereafter), the predicted motion vector of the 
chosen ROS as shown in Figure 5 from the spatial domain, temporally nearby motion 
vectors, and computed global motion vector (GMV). For example, three motion vectors 
from the type B of Figure 5, together with median-predicted motion vector, (0, 0) and 
GMV, can be the six search points of MAP in Figure 6(a). Hence, MAP has a dynamic or 
irregular shape established for each macroblock. ARP, which can be viewed as a less 
irregular MAP, is also invented as shown in Figure 6(b). ARP has a rood shape with four 
arms constantly maintain at the directions in east, west, south and north, respectively. The 
length of rood-arm, r, is adaptively computed for each block initially, and the r is equal to 
the maximum of the city-block distance of the median-predicted motion vector, based on 
the ROS chosen. For each block's motion vector generation, MAP (or ARP, if used) will be 
used only once at the initial search stage to identify where is the most promising position to 
begin the local search from that position. Once the position is found, only SDSP will be 
used throughout the remaining search process until the motion vector is found. 

hi our scalable block-matching fast motion estimation, each method or algorithm is called a 
profile. As mentioned earlier, all the profiles share either frame-based SDSP (Fig. 2(b)) or 



field-based SDSP (Fig. 3(b)), depending on whether it is concerned with non-interlaced or 
interlaced video, respectively, hi the following, search pattern scalable profiles are 
individually described, and they are directly applicable either to frame-based or field-based 
fast motion estimation. 

Profile 1 (or "Simple" Profile) — only the SDSP (Fig. 2(b) for frame-based and Fig. 
3(b) for field-based) is used throughout the entire search. 

That is, in each search stage, the search point that yields the minimum matching 
error will be used as the search center of the new SDSP for the next search. Such 
process will be repeated until the center search point of SDSP yields the minimum 
matching error. 

Profile 2 (or "Basic" Profile) — either LDSP (Figs. 2(a) for frame-based and 3(a) for 

H field-based) or hexagon search patterns (Figs. 4(a) and 4(b) for frame-based or Fig. 

G 4(c) for field-based) is constantly used until the last step when the pattern's center 

position yields the minimum SAD. In such case, only SDSP (Fig. 2(b) for frame- 
based and Fig. 3(b) for field-based) will be used only once, and wherever yields the 

~\ minimum SAD will be considered as the position of found motion vector for that 

macroblock. Note that when LDSP is used, this is basically the DS. (In fact, we can 

j*J view this Basic profile as two sub-profiles: Basic-Diamond profile and Basic- 

s ' Hexagon profile.); 

H Profile 3 (or "Pattern Adaptive" Profile) — either SDSP or LDSP is dynamically 

0 determined to be used for each block at its initial search. The decision of which one 

1 y should be used can be made based on whether LDSP has been ever exploited during 

the search in the earlier-computed neighboring block(s) incurred in the ROS. If no 
LDSP were used in the ROS, only SDSP will be used for the current block's motion 
vector generation; otherwise, Profile 2 will be activated. Alternatively, other simple 
decision logic (such as majority vote) could be practiced. 

Similarly, we can substitute LDSP by hexagon search patterns. In the non-interlaced 
case for performing frame-based motion estimation, we can further have two 
choices: HHSP and VHSP, as shown in Figs. 4(a) and 4(b), respectively. The 
decision could depend on a certain simple criterion, such as whether the largest 
vector component in x- or j-direction incurred in the ROS is in the horizontal (using 
HHSP) or vertical direction (using VHSP). Furthermore, once the HHSP or VHSP 
is chosen, it can be applied throughout the search for the current block, or dynamic 
usage of one of these two patterns along the way, based on a simple decision logic. 



In the interlaced case for performing field-based motion estimation, similar search 
patterns (as shown in Figure 4(c)), practice and criterion can be exploited 
straightforwardly. 



9 



Hence, in fact, we can view this Pattern Adaptive profile comprising two sub- 
profiles: Diamond Adaptive profile and Hexagon Adaptive profile. 

Profile 4 (or "Main" Profile) — MAP (or ARP) is activated for the initial search and 
performed once only. The found position which yields the minimum matching error 
is viewed as the beginning position and for performing the remaining local search 
by using SDSP only; that is, enabling Simple Profile onwards until the motion 
vector is found. 

hi the above-mentioned, these profiles demonstrate an example of categorizing relevant fast 
motion estimation methods and put them in a scalable way for optimal usage. In addition, 
there are certain aspects that are used in our invention with details as follows: 

Initialization: 

The motion vectors of the blocks outside the frame are taken to be equal to (0, 0). If the 
ROS of the current block is defined as the set of blocks to the left, above and above- 
right of the current block (i.e., type B) for example, the corresponding motion vectors 
are denoted by MVi-ij , MVy-i and MVi-i, j+i, respectively. The search-point 
coordinates can be directly established based on the search patterns LDSP, SDSP, 
HHSP, VHSP , IHSP, MAP and ARP as shown in Figs. 2-6, respectively. 

Furthermore, the global motion vector (GMV) is predicted from the motion vector field 
of the reference frame, and note that GMV is presented in MAP (and ARP) only if the 
global motion is present and detected in the reference frame. 

Determination of search range, sr : 

The mean (jli x , j^ y ) and standard deviation (or) of the motion vectors of the reference 
frame are computed. The search range (sr) for the current frame is given by 

sr = maximum of {16, ( (|u*|, |uy|) + 3a) }. 

All search pattern's movement are restricted within the search window defined by the 
search range, sr. 



Detection of no-motion activity 

If the matching error for the current block at the position (0, 0) is less than a threshold T, 
then the block belongs to the no-motion activity region. In this case, the search ends 
here with the motion vector for the current block equal to (0, 0). For that, we have two 
options in choosing the threshold: fixed threshold (we choose T = 512, which is quite 
robust for all kinds of video while maintaining unnoticeable quality degradation) and 
adaptive threshold described as follows. 
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For adaptive threshold, a pre-judgement threshold map (PTM) for each video frame is 
established. Assume that the sum of absolute difference (SAD) is the matching error 
criterion used here for illustration purpose. Let PTM(i, j, t) be the threshold for the 
current block (i, j) in the frame t, and SAD(i, j, t-1) be the prediction error of the same 
block position resulted in the previous frame, t-1. The PTM(i,j, t) can be established as 

PTM(i,j, t) = SAD(i, j, t-1) + 5 , 

where 8 is the adjustment parameter for adapting some tolerance, such as GMV and 
the prediction error fluctuation among the temporal neighboring blocks. 

Determination of nonzero motion activities 

The ROS of the current block consists of its spatially and/or temporally adjacent blocks 
whose motion vectors are already determined in earlier stage. In our invention, the local 
motion vector field at the current block's position is defined as the set of motion vectors 
of the blocks belonging to the ROS of the current MB. The motion activity at the 
current block is defined in the present invention as a general function / of the motion 
vectors in its ROS. Let the evaluated numerical value of function / at the current block 
be L. We define function /as the maximum of the city-block lengths in our invention. 
The motion activity at the current block is classified into different categories such as 
"low", "medium", "high", based on the value of L. Let A and B be two numbers such 
that A < B. Then the procedure to obtain these categories is illustrated as follows: 

Motion activity = Low, if L less than or equal to A 

= Medium, if L greater than A and, less than or equal to B 
= High , if L is greater than B. 

We choose A = 1 and B = 2 in our invention for full-pel case. For half-pel and quarter- 
pel cases, parameters A and B can be scaled and chosen accordingly. 

Prediction of the search center 

The selection of search center could depend on the motion vector in the MAP of the 
current block that gives the minimum matching error is chosen as the search center. 

The selection of search center could also depend on the local motion activity at the 
current block position. If the motion activity is low or medium, the search center is the 
origin. Otherwise, the motion vector in the ROS of the current block that gives the 
minimum matching error is chosen as the search center. 



Search distance computing scalability: sub-sampled computation for macroblock 
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distance measurement 

At each search point visited, the distance of two macroblocks under measurement 
requires to be computed and used in ranking later on. To effectively reduce the 
computation, not all the pixels within the block needs to be counted in distance 
computation. Hence, sub-sampled computation (say, downsampled by a factor of two in 
both horizontal and vertical directions) can be practiced. Note that the relevant thresholds 
shall be adjusted accordingly, if effective. 

Updating "fcode" 

The "f_code" is a special code used in an international video coding standard MPEG-4 
in its motion estimation part. The motion activity information computed as above- 
mentioned can be used to update &ief_code, for the purpose of video indexing and other 
multimedia applications. Since the global motion activities information control the 
search range parameter, the search range can then update the f code. 

While the above can be used in the present invention, various changes can be made, 
for example, instead of the above-mentioned search patterns, any other symmetric search 
patterns might be used. Also, in determining the no-motion activity, instead of comparing 
the matching error of the current block with a fixed threshold, any other matching metric of 
the current block may be compared with a threshold. Likewise, when using adaptive 
threshold, exploiting a memory map of the previous frame for the current frame should be 
considered as a redundant practice. The function / might be any function of its member 
motion vectors. For example, the function may evaluate the maximum of the lengths of the 
motion vectors or the area enclosed by the motion vectors, etc. The motion activity can be 
classified into more, or less, than the categories mentioned, and the methods for selection of 
the search center and search strategies can be used in any other combinations other than 
those described above. All the above-mentioned can be directly applied to video 'frames' or 
'fields' in the context. 



Part B. A Method and Apparatus of 2-D Systolic Array 
Implementation for Diamond Search Motion Estimation 

This part utilizes a systolic array architecture to implement the diamond search fast 
motion estimation so as to speed up the motion vector generation process. As illustrated 
in Fig. 7, the proposed system architecture of this component comprises the following 
parts: (1) 2-D Systolic Array, (2) Memories, (3) Control Logic Unit, and (4) 
Comparison Unit. 

The 2-D Systolic Array consists of a planar arrangement of multiple processing 
elements (PEs), which perform the arithmetic computations to acquire the summation of 
absolute difference (SAD) value for each checking point in the diamond search motion 
estimation method. The results are sent to the Comparison Unit to decide the final motion 
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vector. Memory 1 and Memory 2 are employed to store the current-block data (Cur) and 
the reference-block data (Ref) to be compared, respectively. Control Logic Unit generates 
the memory addresses and manipulates the PE operations in the systolic array. 

The 2-D systolic array diagram is shown in Fig. 8. The current-block data Cur, 
and the reference-block data Ref are inputted to the array from its left line and bottom 
line, respectively. The resulted SAD values are outputted from the top line of the array. 

The whole array consists of P x 3 PE's, that are arranged in P rows and 3 columns, 
where P is the width of the current block (in the following, P = 16 for demonstration), hi 
each PE, the difference, the absolute-value operation and the summation are performed 
sequentially. Fig. 9 shows the block diagram of the PE structure, where c, r and m 
represents Cur, Ref and SAD, respectively. 

Memory 1 is composed of P modules where each module contains Q pixels, and Q 
is the height of the current-block (Q = 16 for normal macro block). Memory 2 has P + 8 
modules which contains all the reference-block data for the surrounding checking points 
of one large diamond search pattern (LDSP) as described above, so that no memory swap 
is required when the checking point is moved from one large diamond search (LDS) to 
another LDS. 

Each module contains Q + 8 pixels, i.e., 24 x 8 bits for normal motion estimation. 
To supply the reference-block data into the boundary PE's, two barrel shifters are 
employed to interface the memory and the boundary PEs, wherein each shifter contains P 
+ 8 registers. With the aids of the shifters, the data from the corresponding modules are 
accessed by the left-shift or right-shift operations when the checking point to be 
processed is moved horizontally from one to another. The interface connections among 
the memory, the barrel shifters and the systolic array are shown in Fig. 10. 

The Control Logic Unit generates all the required control signals for the 
memories, the array and the comparison logic. Accurate timing mechanism has to be 
provided to manipulate the whole data flow of the operations. 

Figure 11 demonstrates the time scheduling for the current-block data and the 
reference-block data when the LDS is performed in the systolic array. The actual 
positions that the subscripts represent in the current and reference images are illustrated in 
Fig. 12. As shown in Fig. 11, the current-block data are inputted into the array as a 
pipeline mode, whereas the reference-block data are supplied in a parallel manner. Notice 
that two idle cycles (slot 1 and slot 2 in Fig. 11) are required in order to initiate the PE 
operations. 
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The Comparison Unit compares the SAD results from the three PE columns 
individually and chooses the motion vector where the minimal SAD value occurs in the 
diamond search. The generated motion vector will be fed into the Control Logic Unit to 
guide the next search position and perform the above-mentioned steps. 



Part C. A Method for Extracting Motion Trajectories of Moving 
Video Objects based on Motion Vectors 

The invention of extracting motion trajectories of moving video objects (VOs) based on 
macroblock motion vectors (MVs) comprises three phases: 1) Motion-vector field 
denoising, 2) Unsupervised clustering and 3) Bi-directional motion tracking. 

1). Motion-vector field denoising 

Motion-vector field extracted directly from MPEG or H.26x bitstreams or generated 
using the techniques described in Part A is first filtered by a proposed noise adaptive soft- 
switching median (NASM) filter with architecture as shown in Figure 13(a). The NASM 
contains a switching mechanism steered by a three-level decision making process to 
classify each MV to be one of the four identified MV types as outlined in Figure 13(b). 
Subsequently, appropriate filtering actions are accordingly invoked. 

1.1 Soft-switching decision scheme 

The first level involves the identification of true MVs. A standard vector median (SVM) 
filter with an adaptive window size of Wdi x Wdi is applied to obtain a smoothed MV 
field. MV-wise differences A/ between the original MV field and the smoothed MV field 
are computed. True MVs are identified to be the ones with much smaller differences. To 
be adaptive to different amount of irregular MVs, steps are repeated twice to estimate the 
percentage of irregularity q using a 7 x 7 SVM filter, and followed by selecting 
appropriate window size by referring to Table 1. 

Two optimal partition parameters pi and p u are derived as two boundary positions. All 
MVs with A; falling onto this range are considered as true MVs. Denote xo <xi < ... < 
x m as the bin indices of the error histogram A*. Each m (for i = 0, 1, . .. , m) indicates the 
number of elements falling on the bin i. Parameters p u is given by 




(EQU 1) 



Similar analysis is repeated for the negative part of the distribution. Let bin indices x- m < 
x-m+i < ... < 0, and m represents the number of elements in bin i. Parameter pi is given by, 
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Pt = 



(EQU 2) 



The percentage of irregularities q is conservatively determined by subtracting the 
percentage of true MVs from the one-hundred percent. 

The second level involves the identification of isolated irregular MVs. Given a MV as the 
center MV within a Wd2 x Wd2 decision window, the membership values of its 
neighboring MV s , t within the decision window are defined as 



for -{Wd2-\)I2 <s , t < (Wd2-\)I2 and ( s , t) * ( 0 , 0 ). Parameters d sj and d u , v are the 
magnitude-wise differences of MV S , ( and MV„, V with respect to the center MV. 
Parameters u and v have the same value range as s and t, i.e., -{Wd2-\)I2 < u , v < (Wd2- 
l)/2 and (u , v ) ?t ( 0 , 0 ). Starting with Wm = 3, the decision window repeatedly extends 
outwards by one unit in all the four window sides provided that the number of true MVs 
are less than (Wd2 x Wd2)I2, or until Wf = Wdi. That is, parameter Wd2 is an odd integer, 
which satisfies 3 < Wd2 < Wdi. 

The mean of jus,t is used to divide the membership map jusj into two groups — higher and 
lower-value groups, denoted by juiow and fjhigh ■ The decision rule for detecting an isolated 
irregular MV is defined by: 

(i) If jjiow I fjkigh < 3, the center MV is claimed as an isolated irregular MV. 

(ii) If /Mow I iMiigh > 3, further discrimination at the third level will be required. 

The third level distinguishes the considered center MV as being a non-isolated irregular 
MV and an edge MV. The algorithm respectively checks each side of the window 
boundary of Wm x Wd2 obtained in level two. If there is (are) closely correlated MV(s) to 
that of the center MV at any one of the four boundaries, the boundary will be 
subsequently extended by one pixel position to obtain an enlarged window. Denote N c as 
the number of "closely correlated MVs" within the enlarged window. The decision rule 
for discriminating non-isolated irregular MV and an edge MV are: 

(i) If N c < Sin, the considered MV is a non-isolated irregular MV; otherwise, 

(ii) If N c >Sin, the considered MV is an edge MV. 




(EQU 3) 



Threshold Sin is conservatively defined to be half of the total number of uncorrupted MVs 
within the enlarged window. 
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1.2 Filtering scheme 

For identified true MVs, they are unaltered in order to preserve the fine details of MV 
field. Standard vector median (SVM) and an invented fuzzy weighted vector median 
(FWVM) filters are exploited for irregular MVs and edge MVs, respectively. For the 
proposed FWVM filter, the fuzzy membership function jus.t computed earlier are re-used 
to determine the weights of true MVs within 2lWfxWf filtering window. The weighting 
factors of those considered true MVs are defined to be 



where X = 2 jus.t + (jc and /jc I X is the weighting factor assigned to the center MV. 
Parameter is optimally determined by minimizing the output data variance such that 
the noise attenuation will be maximized, which is given by 



2. Maximum Entropy Fuzzy Clustering 

The NASM-filtered MVs are then grouped into an optimum number of cluster centers by 
our invented unsupervised maximum entropy fuzzy clustering (MEFC) to segment MV 
field into homogeneous motion regions. Figure 14 shows the architecture of the MEFC. 

2.1 Outer loop 

The outer loop recursively increases the number of clusters c from 1 until it reaches to a 
pre-determined maximum value c ma x , i.e., c = 1, 2, ... , c ma x . In each outer-loop iteration, 
a new cluster center will be initialized to split the largest cluster into two smaller clusters 
based on the measured fuzzy hypervolume. Denote the input MVs as { x,- 1 x, e 9f and i = 
1,2, ... , N) and the corresponding cluster centers as { c y - 1 c, e 9* s and j = 1, 2, ... , c }. 
Initially, all data samples are considered belong to one dominant cluster. That is, c = 1 
and c{ 0) = y,. , x f / N . This dominant cluster is then optimally split into two according to 




(EQU4) 




for (s,t) * (0,0). (EQU 5) 




(EQU 6) 



In the subsequent iterations, each new cluster center is initialized according to 
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c<°> = |x ( . e X | max d(x i , c LH ) and > £ j , (EQU 7) 

where £is a pre-determined confident limit to claim a data sample to be strongly 
associated to the cluster center of the largest cluster clh ■ 

2.2 Inner loop 

The inner loop recursively updates the membership values and the cluster centers to 
converge newly initialized and existing cluster centers to each respective new optimum 
position. The process involves (i) updating the membership values of all the data samples 
with respect to the cluster centers determined from the previous iteration, and (ii) 
computing the new cluster centers' positions based on the membership values computed 
in the current iteration. That is, denote U= [ jmj ] Nxc in fuzzy membership domain Mf 1 * 0 
and C = [cj\ cxs in feature space 9{ CxS . The inner process can be presented by recursively 
updating the following steps 

U = F(C), where F x <R CXS -» M^ c , 
C = G(U), where G:Mf c ^9r-. (EQU 8) 

These two steps alternately update each other until a convergence condition is met. That 
is, | U(t+i) - U( t ) | < k, where k is a small value. 

The membership function juy of the MEFC is derived by maximizing the entropy 
constrained to minimizing a fuzzy weighted distortion measurement. The membership 
function is derived to be 

^ ^J^ ' (EQU9) 

Parameter ft is the Lagrange multiplier introduced in the derivations and are coined as 
"discriminating factor". The optimal value of ft for /' = 1, 2, . . . , Nis obtained to be 

A(o P t)=-^, (EQU 10) 

where s is a small value and ^-(min) is the distance of each x* from its nearest cluster 
center c p , i.e., d(xj , c p ) < d(Xi , c q ) for q = 1 , 2, ... ,c and q^p. 

For the updating expression for cluster centers cj, it is given by 



17 



c^S^, (EQU11) 

To identify the optimum number of clusters c, cluster validity V c is formulated in terms of 
intra-cluster compactness and inter-cluster separability to measure the clustering 
performance of each c value. The cluster's compactness is defined as 

P Dj=-p-> (EQU12) 

t HVj 

where Sj = YZxMj 1 N , f hvj = [ det ( F })J /2 and Fj is the covariance matrix of y'th cluster. 
For measuring inter-cluster separability, the principle of minimum entropy is exploited to 
be 

£;=-fX-log/v (EQU13) 

Since we aim to maximize Pd/ and minimize Ej for cluster number c, we have the cluster 
validity measurement defined to be 

With the formulated cluster validity V c , this allows the evaluation of the clustering 
performance for each cluster number c. MVs will be segmented into an optimum number 
of regions since the optimal cluster number corresponds to the one that gives a maximum 
value of V c . 



3. Bidirectional motion tracking for motion trajectory extraction 

A bidirectional motion tracking process is then performed to form valid VOs from the 
segmented homogeneous motion regions. The bi-directional motion tracking is structured 
into three successive steps, involving bi-directional projection, motion trajectory 
extraction and Kalman filter smoothing, as shown in Figure 15. 

3.1 Bi-directional projection 

Validated VOs from the previous P-frame Ok(n-n p ) and segmented regions Rfai+nj) from 
future P-frame are bi-directionally projected onto current frame based on a second order 
kinematics model. Motion characteristics of the Ok(n-n p ) and Ri{n+nj) are assumed to be 
constant in the projection process. Thus, by forwardly projecting Ok{n-n p ) onto the current 
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frame, the resulting displacement in the right and down directions could be respectively 
expressed by 

D f r =v?xn p , (EQU15) 
D d= v d xn P . (EQU16) 
The velocities of Ok(n-n p ) in both directions are given by 

v p _ pixel i ^ (EQU 1 7) 

R Ref 
n p 

u v p d = pixel / frame , (EQU 1 8) 

P W Ref 

P where i" r p and are the means MV of 0^n-n p ), and «R e f is the number of frames from 
N the reference frame. By the same principles, the displacement in the right and down 
directions for Ri(n+nj) in the future frame are expressed to be 

& Z>* = -v/ x W/ > (EQU 1 9) 

L 

N; ^ = -vj" x W/ . (EQU 20) 

gj 3.2 Motion trajectory extraction 

f E= Each segmented region obtained after the MEFC process may be a valid VO or a section 
of a valid VO, or even a large region that encompasses few VOs. To identify the 
semantics meaning conveyed by each segmented region (i.e., unconnected or connected 
region), our strategy is to identify various possible scenarios that have caused the 
generation of the segmented regions. 

For unconnected regions, let event A describes the interaction between segmented region 
Ri(n) of current frame and the projected VO(s) Oi^n-n p ) from previous frame, given by 

A . 

A > (EQU 21) 

4 . 



where 
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A x = event "Considered unconnected region overlaps with one 
projected VO's motio mask from the previous frame," 

A 2 = event " Considered unconnected region overlaps with multiple 
projected VOs' motion masks from the previous frame," 

A 3 - event "Considered unconnected region overlaps with none 
of the projected VO's motion mask from the previous frame," 



Bj = event " Considered unconnected region overlaps with one 

proj ected homogeneous - motion region from the future frame, " 
B 2 = event " Considered unconnected region overlaps with multiple 

projected homogeneous - motion region from the previous frame," 
B 3 = event " Considered unconnected region overlaps with none 

of the proj ected homogeneous - motion regions from the future frame . " 



Actions to be taken for various combination of events (A, B) are concluded into four cases 
as tabulated in Table II. In Case 1, i? ; (n) is mapped to Oi^n-n p ). In Case 2, Ri(n) is mapped 
to the projected VO that gives the minimum discrepancy in motion direction, hi Case 3, 
Ri(n) is identified be a new VO. Region i?,(n) is spurious noise in Case 4 and 
subsequently to be discarded. 

For connected regions, they interact with the projected Ok(n-n p ) from previous frame and 
Ri(n+nj) from future frame in the same way as that of non-connected regions described by 
events C and D as follows 



and 



B = <B 2 , (EQU 22) 
B 3 , 



where 



C = 




(EQU 23) 



where 



20 



C j = event " Both the considered connected - regions are associated to the same 

proj ected VO 1 s motion mask from previous frame," 
C 2 = event " Both the considered connected - regions are associated to two different 

projected VOs' motion masks from previous frame," 
C 3 = event "Both the considered connected - regions are associated to none 

of the projected VO' s motion mask from previous frame." 



Z>! = event "Both the considered connected - regions are associated to the same 

projected homogeneous region from future frame," 
D 2 = event "Both the considered connected - regions are associated to two different 

projected homogeneous - motion region from future frame," 
£> 3 = event "Both the considered connected - regions are associated to none 

of the homogeneous - motion region from future frame," 

Table EI summarizes the actions to be taken for different combination of events (Q D). In 
Case 5, the connected regions are merged together to form a valid VO, 
i.G.,O k (ri) = \jR i (ri). in Case 6, the connected region are split into separate and 

i 

independent VO which are mapped separately to different projected VO Ok{n-n p ). In Case 
7, connected regions are merged together to form a new VO. In Case 8, more information 
from future frames is required to further discriminate connected regions to be (i) different 
parts of a valid VO or (ii) independent valid VO which initially locate close to each other 
and will separate into independent regions eventually. In Case 9, region Ri(n) is identified 
be spurious noise. Thus, the region should be discarded as in Case 4. 

Checking of abrupt missing VO is also performed. If this happens, the VO's mask from 
previous frame is forward projected onto current frame based on second order kinematics 
model to estimate the new position in the current frame. 

Subsequently, motion trajectories of the VOs are estimated by taking the centroid of each 
VO in each frame, i.e., 




(EQU 24) 



where 
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* y = yc Oy(0i (EQU25) 
where C Oy(0 is the centroid of VO O/(0 at frame 
3.3 Kalman filter smoothing 

In the last stage, the obtained motion trajectories are smoothed by Kalman filtering. The 
following shows the formulation of the problem into state-space equations to be fed into 
iteration process of Kalman filtering. The trajectory of the target VO in two-dimensional 
Cartesian plane at time nT, where 1/T is the frame rate, is defined as 




(EQU 26) 



: 3 

p The displacement, velocity and acceleration vectors are defined as 

| C{n + l) = C(n)+ Tt{n)+± T 2 £{n) + Vp («) , (EQU 27) 

w 

" C{n + l) = C(n)+TC{n)+Tj v (n) } (EQU 28) 

p £{n + \) = £(n) +?J a (n), (EQU 29) 

O where V P {n), Vv( n ) and *7 a («) are the estimation errors, which individually possess 
pss Gaussian distribution. Define the state vector of the target VO as 
^.(n)=[c(n),C(«),^(«)] r and the corresponding process error vector as 
v i ( n ) = h P i ( w ) ' Vvi ( n ) » (")F , hence the state equation can be expressed as 

x, (n + 1) = Fx,. («) + V| , (EQU 30) 
and the observation vector can be modeled as 

z t (n + 1) = fl5t, (« + 1) + m i (n + 1) , (EQU 3 1 ) 
where F = (l rjf ; 0 1 T ; 0 0 l) and H = {\ 0 o). 

With the derived state-space equation given by (30) and (31), the standard Kalman filter 
will be applied to give smoothed motion trajectories. 
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Part D Curve Recognition Using Evolutionary Alignment with 
Concave-Gap Penalty and Complex Scoring Matrix Technique 

In this part, we introduce a generic approach to perform alignment operation for 
given two curves under matching and quantitatively measuring their degree of similarity. 
The term of "curve" here is a generic representation or result of tracing the boundary of a 
shape, drawing a simple sketch or writing a character/symbol in one continuous stroke, or 
any such said information generation process/operation. Note that all one-stroke 
handwriting curves are represented by a chosen chain-coding scheme first. The resulted 
chain codes as the strings are considered to be a special representation describing the 
curves individually. To match a pair of curves, their chain-code strings are aligned, 
compared, and measured. 

u The evolutionary alignment algorithm is used to quantitatively measure the 

j«l similarity between two curves described by their chain codes. If two curves are quite 
p similar to each other, most of their chain codes will be matched, and the remaining chain 
p codes can be altered for matching by inserting a code, deleting a code, or replacing a code 
\l by another. Each of the above-mentioned operation will incur a score for contributing the 
llj final matching score or similarity score (SS) as follows. 

jj'l Given two strings of curves, A = aj a2 ... om and B = bi D2 ... Bn, curve A can be 

Z" matched by curve B by means of one of three possible operations: (1) deleting k 
j** consecutive codes, (2) inserting k consecutive codes, and (3) replacing a code by another. 
j»* For each above-mentioned symbol operation, a corresponding scoring method is 
p designed. For example, a positive cost for a perfect matching or an unchanged 
PIJ replacement can be imposed. The SS is the final score as the result of matching curve A 
P against curve B by performing these three symbol operations. That is, the SS is a 

quantitative measurement in evaluating the degree of similarity between curves A and B. 

Two curves are considered to be quite similar to each other, if the value of SS is high, and 

the higher the value, the larger the similarity. 

One constant or function for the cost of opening up a gap and one constant or 
function for the cost of inserting or deleting a code is used. For example, two negative 
constants, g and h, are introduced to establish an affine function: 

gap(k) = g + hk, (l) 

for the penalty incurred in inserting or deleting k symbols. Opening up a gap will cost 
score g, and each symbol inserted into or deleted from the gap will cost additional score 
h; thus, penalty score hk for k symbols. For the latter, it means that a set of k symbols 
from string A is deleted, or a set of k symbols from string B is inserted. 
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Replacement costs are specified by a scoring matrix d(a ir bj), which gives the cost 
of replacing code m by code bj. Note that a code of A remains unchanged, if it is replaced 
by itself (i.e., when two codes at and bj are perfectly matched) and gains the highest score. 
Usually, d(au bj) > 0, if at = bj, and d(a u bj) < 0, if m < bj. For example, in the 
application of handwriting character recognition using for 8-directional chain code 
encoding method: 

,/ , i / 4, if a t =b,-; 
d(a i ,b j ) = ( „ ' . J (2) 

\ -3, Otherwise. 





Irregular MV density 


Suggested Wn, X ffn, 




0%<a<2% 


No filtering 


5 


2%<a<\5% 


3x3 




15%<^<30% 


5x5 


m 


30% < a < 45% 


7x7 




45%<^<60% 


9x9 




60% < a < 70% 


11x11 



Table 1: Suggested window sizes for various estimated value of parameter q. 
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B, 


B, 


A, 


Case 1 


Case 1 


Case 1 


A? 


Case 2 


Case 2 


Case 2 


A, 


Case 3 


Case 3 


Case 4 



Table 2: Actions to be taken for various combinations of events (A , B). 



Events 


D, 


D? 


£>? 
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Case 5 


Case 5 


Case 5 


c? 


Case 6 


Case 6 


Case 6 


c, 


Case 7 


Case 8 


Case 9 



Table 3: Actions to be taken for various combinations of events (C , D). 
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While preferred embodiments of the present invention have been shown and 
described, it will be understood by those skilled in the art that various changes or 
modifications can be made without varying from the scope of the present invention. 



